Creating a Dual-Purpose Treebank

نویسندگان

  • Eiríkur Rögnvaldsson
  • Anton Karl Ingason
  • Einar Freyr Sigurðsson
  • Joel Wallenberg
چکیده

We describe the background for and building of IcePaHC, a one million word parsed historical corpus of Icelandic which has just been finished. This corpus which is completely free and open contains fragments of 60 texts ranging from the late 12 century to the present. We describe the text selection and text collecting process and discuss the quality of the texts and their conversion to modern Icelandic spelling. We explain why we choose to use a phrase structure Penn style annotation scheme and briefly describe the syntactic annotation process. Furthermore, we advocate the importance of an open source policy as regards language resources.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Creating a Tree Adjoining Grammar from a Multilayer Treebank

We propose a method for the extraction of a Tree Adjoining Grammar (TAG) from a dependency treebank which has some representative examples annotated with phrase structures. We show that the resulting TAG along with corresponding dependency structure can be used to convert a dependency treebank to a TAG-based phrase structure treebank.

متن کامل

ITU Treebank Annotation Tool

In this paper, we present a treebank annotation tool developed for processing Turkish sentences. The tool consists of three different annotation stages; morphological analysis, morphological disambiguation and syntax analysis. Each of these stages are integrated with existing analyzers in order to guide human annotators. Our semiautomatic treebank annotation tool is currently used both for crea...

متن کامل

Hindi CCGbank: CCG Treebank from the Hindi Dependency Treebank

In this paper, we present an approach for automatically creating a Combinatory Categorial Grammar (CCG) treebank from a dependency treebank for the Subject-Object-Verb language Hindi. Rather than a direct conversion from dependency trees to CCG trees, we propose a two stage approach: a language independent generic algorithm first extracts a CCG lexicon from the dependency treebank. A determinis...

متن کامل

Domain Adaptation with Artificial Data for Semantic Parsing of Speech

We adapt a semantic role parser to the domain of goal-directed speech by creating an artificial treebank from an existing text treebank. We use a three-component model that includes distributional models from both target and source domains. We show that we improve the parser’s performance on utterances collected from human-machine dialogues by training on the artificially created data without l...

متن کامل

Creating a Dependency Syntactic Treebank: Towards Intuitive Language Modeling

In this paper we present a user-centered approach for defining the dependency syntactic specification for a treebank. We show that by collecting information on syntactic interpretations from the future users of the treebank, we can model so far dependency-syntactically undefined syntactic structures in a way that corresponds to the users’ intuition. By consulting the users at the grammar defini...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • JLCL

دوره 26  شماره 

صفحات  -

تاریخ انتشار 2011